Generation of realistic synthetic data using Multimodal Neural Ordinary Differential Equations

Individual organizations, such as hospitals, pharmaceutical companies, and health insurance providers, are currently limited in their ability to collect data that are fully representative of a disease population. This can, in turn, negatively impact the generalization ability of statistical models and scientific insights. However, sharing data across different organizations is highly restricted by legal regulations. While federated data access concepts exist, they are technically and organizationally difficult to realize. An alternative approach would be to exchange synthetic patient data instead. In this work, we introduce the Multimodal Neural Ordinary Differential Equations (MultiNODEs), a hybrid, multimodal AI approach, which allows for generating highly realistic synthetic patient trajectories on a continuous time scale, hence enabling smooth interpolation and extrapolation of clinical studies. Our proposed method can integrate both static and longitudinal data, and implicitly handles missing values. We demonstrate the capabilities of MultiNODEs by applying them to real patient-level data from two independent clinical studies and simulated epidemiological data of an infectious disease.


Additional figures for PPMI experiments
To provide an overview about more variables, we present additional plots equivalent to those in the main text below. Plots for all variables, including relevant summary statistics like those presented in Supplementary Table 1 Supplementary Figure 4: Correlation structure of the interpolated and extrapolated PPMI data (i.e., generated using only parts of the real data) in comparison to the correlation structure of the real data and that of synthetic data generated based on the complete real data.

NACC results
Below, we present the results based on the NACC dataset.  Interpolation of the NACC data was performed for year 2. For extrapolation, MultiNODEs were trained on data up to year 2 and the values for years 3 and 4 were extrapolated. Exemplary results are shown in Supplementary Figure 8. We observed that the mean JS-divergence calculated across all variables between the interpolated data and the real data was slightly higher (0.064 ± 0.054) than that of the real data and the synthetic data generated after training MultiNODEs on the complete trajectory (0.049 ± 0.026).
As in the interpolation setting, we again compared the average JS-divergence between the extrapolated data and the real data with that between the real data and synthetic data that were generated after training MultiNODEs on the complete trajectory. As expected, we could see a larger difference between the JS-divergences compared to the interpolation setting with 0.065 ± 0.024 for the extrapolated data and 0.022 ± 0.009 for the synthetic data based on the complete trajectory.

Hyperparameter optimization
To apply MultiNODEs (and the VAMBN) to data, it is necessary to perform hyperparameter optimization. In this work, we used a Bayesian hyperparameter optimization [1] as implemented in the Optuna package [2]. We used the default settings employing a Tree of Parzen Estimator. To train the VAMBN approach, it is again necessary to conduct a hyperparameter optimization. For every module of VAMBN, we performed a Bayesian hyperparameter optimization. We evaluated every configuration of hyperparameters with a 5-fold-crossvalidation using the reconstruction loss (cross entropy) as a target function. The hyperparameters were the following:

SIR Model
As parameters for the SIR model, we set T equal to 40, beta equal to 2, and gamma equal to 1. The population size was set to 1000.

Classifier training
We used the random forest implementation of sklearn with their default hyperparameters. To impute missing values in the real data, we used the missForest iterative imputer of sklearn for a maximum of 100 iterations with 10 estimators.
Supplementary Figure 13 shows the relative feature importance for the respective classifiers. Feature importance was calculated as decrease in classifier performance when the respective feature was excluded. For PPMI, the feature importance was accurately reflected in all synthetic classifiers. For NACC, we mainly saw lower importance of biological sex across all synthetic dataset-trained classifiers and an increase in importance for FAQ, a functional assessment of patients. Conclusively, the feature importance was quite stable between synthetic and real data-trained classifiers for both datasets.
Supplementary Figure 13: Feature importance for predictors used to discriminate between real healthy control subjects and real and synthetic patients, respectively. Error bars depict the standard deviation calculated over 10 repeated runs of each classifier.

Preprocessing of the PPMI data
To express the disease progression, we used z-scores normalized to a patient's baseline values. Let x t i , j long n be the value of a longitudinal variable j long ❑ at a time point t i of patient n, μ t bl , j long be the mean of a longitudinal variable at baseline and σ t bl , j long be the standard deviation of that variable at baseline. Then a z-score with respect to baseline can be defined as follows The static real valued variables were standardized, and the static categorical variables one-hot encoded.
To handle SNPs, we use CADD-filtered Impact Scores and Polygenetic Risk Scores.
We receive odds ratios for the risk SNPs from the Phenome Wide Association Studies (PheWAS) catalog of genome-wide association studies (GWAS) [3]. To compute a polygenetic risk score for every patient, we sum up the odds ratio of the occurring risk SNPs per patient. The so-called Combined Annotation Dependent Depletion (CADD) values rate the maleficence of SNPs. To get CADD-filtered impact scores, we sum up the number of occurring risk SNPs over the recommended threshold of 15 per patient.

Final model specifications
MultiNODEs implement several approaches to counteract overfitting to the training data.
We implemented drop-out layers in the encoder, used cross-validation to tune hyperparameters, and the ELBO of the variational autoencoder framework represents another form of model regularization.
The models were optimized using the default implementation of pyTorch's Adam optimizer.
Below you find the chosen hyperparameters for all MultiNODEs used in the manuscript. In the case of an Elman-network as an encoder: TanH